Search CORE

80 research outputs found

A Comparison of Feature-Based and Neural Scansion of Poetry

Author: Agirrezabal Manex
Alegria Iñaki
Hulden Mans
Publication venue
Publication date: 01/01/2017
Field of study

Automatic analysis of poetic rhythm is a challenging task that involves linguistics, literature, and computer science. When the language to be analyzed is known, rule-based systems or data-driven methods can be used. In this paper, we analyze poetic rhythm in English and Spanish. We show that the representations of data learned from character-based neural models are more informative than the ones from hand-crafted features, and that a Bi-LSTM+CRF-model produces state-of-the art accuracy on scansion of poetry in two languages. Results also show that the information about whole word structure, and not just independent syllables, is highly informative for performing scansion.Comment: RANLP 201

arXiv.org e-Print Archive

Crossref

Copenhagen University Research Information System

Semantikan oinarritutako bilaketak: Kyoto proiektua

Author: Alegria Iñaki
Rigau German
Publication venue: Euskal Herriko Unibertsitatea = Universidad del País Vasco
Publication date: 01/01/2010
Field of study

Semantic-based research: Kyoto Project. In the digital management of documentation, the use of the text itself can be very interesting, in addition to the descriptors. Many descriptors are also text. The use of linguistic engineering techniques opens up new options for accessing information from these databases: multilingual access, semantic grouping, access based on similarity, question-answer systems, information inference, etc. This paper looks in more detail at the possibilities based on semantics, setting out the research areas being developed by the authors as part of the European Kyoto project

E-LIS

Semantikan oinarritutako bilaketak: Kyoto proiektua

Author: Alegria Iñaki
Rigau German
Publication venue: Euskal Herriko Unibertsitatea = Universidad del País Vasco
Publication date: 01/01/2010
Field of study

Strategies to develop Language Technologies for Less-Resourced Languages based on the case of Basque

Author: Alegria Iñaki,
Artola Xabier,
Díaz De Ilarraza Arantza
Sarasola Kepa
Publication venue: HAL CCSD
Publication date: 25/11/2011
Field of study

IXA group has developed during 23 years a basic set of resources, tools and applications for Basque following to an initial strategy which has been adapted according to technological changes. We think that our strategy and experience can be a reference for other less resourced languages. According to a six level classification of world languages, we estimate that this strategy may be useful for several hundred languages, those that have developed a written standard but that still are beginners in Human Language Technology

ArtXiker - @HAL

A Methodology to Measure the Diachronic Language Distance between Three Languages Based on Perplexity

Author: Alegria Iñaki
Gamallo Otero Pablo
Neves Marco
Pichel Campos José Ramón
Publication venue: 'Informa UK Limited'
Publication date: 01/01/2020
Field of study

This is an Accepted Manuscript of an article published by Taylor & Francis in Journal of Quantitative Linguistics on 01 Mar 2020, available online: http://www.tandfonline.com/10.1080/09296174.2020.1732177The aim of this paper is to apply a corpus-based methodology, based on the measure of perplexity, to automatically calculate the cross-lingual language distance between historical periods of three languages. The three historical corpora have been constructed and collected with the closest spelling to the original on a balanced basis of fiction and non-fiction. This methodology has been applied to measure the historical distance of Galician with respect to Portuguese and Spanish, from the Middle Ages to the end of the 20th century, both in original spelling and automatically transcribed spelling. The quantitative results are contrasted with hypotheses extracted from experts in historical linguistics. Results show that Galician and Portuguese are varieties of the same language in the Middle Ages and that Galician converges and diverges with Portuguese and Spanish since the last period of the 19th century. In this process, orthography plays a relevant role. It should be pointed out that the method is unsupervised and can be applied to other languagesThis work has received financial support from DOMINO project [PGC2018-102041-B-I00, MCIU/AEI/FEDER, UE]; eRisk project [RTI2018-093336-B-C21]; the Consellería de Cultura, Educación e Ordenación Universitaria (accreditation 2016-2019, ED431G/08, Consolidation and structuring of Groups with Growth Potential: 745ED431B 2017/39) and the European Regional Development Fund (ERDF)S

Repositorio Institucional da Universidade de Santiago de Compostela

A spelling corrector for basque based on morphology

Author: Aduriz Itziar
Alegria Iñaki
Artola Xabier
Ezeiza Nerea
Sarasola K.
Urkia Miriam
Publication venue: 'Oxford University Press (OUP)'
Publication date: 22/04/2021
Field of study

This paper describes the components used in the elaboration of the commercial Xuxen spelling checker/corrector for Basque. Because Basque is a highly inflected and agglutinative language, the spelling checker/corrector has been conceived as a by-product of a general purpose morphological analyser/generator. The spelling checker/corrector performs morphological decomposition in order to check misspellings and, to correct them, uses a new strategy which combines the use of an additional two-level morphological subsystem for orthographic errors, and the recognition of correct morphemes inside the world-form during the generation of proposals for typographical errors. Due to a late process of standardization of Basque, Xuxen is intended as a useful tool for standardization purposes of present day written Basque

Diposit Digital de la Universitat de Barcelona

Teknologia garatzeko estrategiak baliabide urriko hizkuntzetarako: euskararen eta Ixa taldearen adibidea

Author: Aduriz Itziar,
Alegria Iñaki,
Artola Xabier,
Díaz De Ilarraza Arantza
Sarasola Kepa
Publication venue: HAL CCSD
Publication date: 01/06/2011
Field of study

El artículo comienza presentando varios datos que muestran la situación de la lengua vasca, y a continuación proponiendo una clasificación para las lenguas del mundo según sea su presencia en Internet y en la tecnología de la lengua. El cuerpo del artículo presenta el trabajo hecho por el grupo Ixa en el campo del procesamiento automático del euskara, identificando sus siete hitos principales y describiendo la estrategia que ha guiado este desarrollo. Se plantea que esta estrategia puede servir como referencia para 190 lenguas que según la lasificación propuesta no poseen recursos de tecnología de la lengua pero si poseen una mínima presencia significativa en Internet.Euskararen egoeraren inguruan hainbat datu ematen dira labur-labur, eta horrekin batera munduko hizkuntzak sailkatzeko proposamen bat aurkezten da Interneten eta hizkuntz teknologian duten egoeren araberakoa. Euskararen prozesaketa automatikoan Ixa taldeak izan duen bilakaeraren nondik norakoak zehazten dira gero, hainbat mugarri azpimarratuz eta ibilbide hori jarraitzeko erabili den estrategia deskribatuz. Munduko 190 hizkuntzentzat erreferentzia izan daiteke estrategia hori, hain zuen, Interneten presentzia minimo eduki bai baina oraindik hizkuntza-teknologia mota hau landu ez duten hizkuntzentzat

ArtXiker - @HAL

TweetMT : a parallel microblog corpus

Author: Alegria Iñaki
Aranberri Nora
España-Bonet Cristina
Gamallo Pablo
Gonçalo Oliveira Hugo
Martinez Garcia Eva
Toral Antonio
Vicente Iñaki San
Zubiaga Arkaitz
Publication venue: European Language Resources Association
Publication date
Field of study

We introduce TweetMT, a parallel corpus of tweets in four language pairs that combine five languages (Spanish from/to Basque, Catalan, Galician and Portuguese), all of which have an official status in the Iberian Peninsula. The corpus has been created by combining automatic collection and crowdsourcing approaches, and it is publicly available. It is intended for the development and testing of microtext machine translation systems. In this paper we describe the methodology followed to build the corpus, and present the results of the shared task in which it was tested

Warwick Research Archives Portal Repository

Introducción a la tarea compartida Tweet-Norm 2013: Normalización léxica de tuits en español

Author: Alegria Iñaki
Aranberri Nora
Fresno Víctor
Padró Lluís
Samallo Pablo
San Vicente Iñaki
Turmo Borras Jorge
Zubiaga Arkaitz
Publication venue
Publication date: 01/01/2013
Field of study

En este artículo se presenta una introducción a la tarea Tweet-Norm 2013 : descripción, corpora, anotación, preproceso, sistemas presentados y resultados obtenidos.Postprint (published version

UPCommons. Portal del coneixement obert de la UPC

Massively multilingual accessible audioguides via cell phones

Author: Alegria Iñaki
Astigarraga Aitzol
Cortes Itziar
Garaio Manex
Leturia Igor
Sarasola Kepa
Publication venue: European Association for Machine Translation
Publication date: 01/01/2018
Field of study

Bidaide is a web service that allows the visitors of a museum, route or building to read or listen to explanations relative to the visited place on their own mobile and in their own language. The visitor can access the explanations in various ways: by scanning some QR codes located in the place, by GPS positioning (in outdoor routes), or by automatic Bluetooth proximity activation. This makes it accessible for people with reduced or null vision. On the other hand, this platform also offers to the manager of the visited site the most advanced language resources to create the texts and audios of the explanations in many languages

Repositorio Institucional de la Universidad de Alicante